Summary: Metabolic reprogramming and cell death evasion are hallmarks of many cancers, implicating the mitochondria in carcinogenesis. Study of mutations to the mitochondria’s miniature circular genome have identified cancer-specific mutation patterns that may be relevant to cancer initiation/progression.
Mitochondria are the power generators of the cell, wherein a system of proteins embedded in the inner membrane generate the universal energy currency of the cell, ATP. In addition to their canonical role in cell metabolism, they are involved in other myriad processes in inflammation and apoptosis (programmed cell death). Cancer cells are characterized by high energy requirements, unchecked proliferation, and avoidance of cell death controls, so mitochondria are hypothesized to play a central role in the genesis and progression of cancer (Hertweck and DesDasgupta (2017); Brandon, Baldi, and Wallace (2006); Zong, Rabinowitz, and White (2016); Hanahan and Weinberg (2011)).
The mitchondrion is believed to have originated from a bacteria, engulfed by a pre-eukaryotic cell around 1 billion years ago. The mitochondrion possesses it’s own small genome, about 16.6 kb long, encoding 13 proteins that form the cellular respiration machinery of the organelle. The genome also encodes 22 transfer RNAs and 2 ribosomal RNAs (Hertweck and DesDasgupta (2017)). Mutations to the mitochondrial genome have been associated with many cancer types, although the mechanism of action of many of these mutations is unclear.
In the nuclear genome, there is remarkable heterogeneity in common and rare variants between genes and across cancer types. Greater than 90% of mutations in The Cancer Genome Atlas (TCGA) are singletons, meaning they appear once across the TCGA sequenced tumors. Previous research used a non-parametric Bayesian methods developed in the fields of ecology and computational linguistics to mine cancer-specific signals from the ‘hidden genome’ of previously unseen mutations (Chakraborty et al. (2019)). This research represented an novel way of examining mutation in human cancer by focusing on the mutations that have yet to be documented.
High depth, multi-sample molecular characterizations of the mitochondrial genome have illustrated a number of parallels between mtDNA and nuclear DNA in relation to their mutational landscapes (Yuan et al. (2020)). Mutational hotspots (which have been well documented in the nuclear genome) were observed in the regulatory D loop and the ND4 gene. ND5 was the most frequently mutated gene, ND4 was most frequently mutated in prostate and lung cancers, and COX1 was most frequently mutated in breast, cervical, and bladder cancers. Yuan et al. identified that truncating mutations were enriched in kidney, colorectal, and thyroid cancers, while copy number variants showed remarkable heterogeneity between cancers. In summary, like the nuclear genome, the mitochondrial genome contains an exotic and rich diversity of mutations between genes and between cancer types.
This project aims to surveill the landscape of single nucleotide variants in the cancer mitochondrial genome, with a special emphasis on applying bayesian nonparametric methods to estimate previously unseen variant signals in the mitochondrial genome. 5% of global cancer diagnoses are of unknown tissue of origin and are associated with poor health outcomes; a systematic, data-driven dive into uncovering the mutational hallmarks of different cancers in the mitochondrial genome could help improve or sophisticate attempts at classifying cancers based on a cancer’s mutation profile.
Summary: We analyzed SNV data from 2177 tumor samples spanning 22 tissues from The Cancer Mitochondria Atlas. We used a table scraped from MitoMap to map variants to particular genomic features
There has been limited research on the frequencies of rare variation within the mitochondrial genome across several tissue types. Yuan et al. created The Cancer Mitochondria Atlas (TCMA), an open-access web portal to support research and education surrounding mitochondrial cancer genetics. The portal contains open source somatic variant, copy number, nuclear transfer data, and a coexpression dataset. This project will restrict its focus to analysis of somatic variant mutations.
The TCMA somatic variant data data contains 7611 instances of somatic variant mutations across 2177 tumor samples spanning 36 primary sites and histologies. The data includes the ‘sample_id’, which corresponds to the unique identifier for each patient/tumor sample. The ‘cancer_type’ column denotes the tissue of origin and histology of each cancer. The ‘chrom’ column is ‘MT’ indicating all data comes form the mitochondrial genome. ‘position’ denotes the locus of mutation along the 16.6 kb mitochondrial genome. ‘ref’ and ‘var’ denote what the unmutated reference nucleotide is and what the mutated variant nucleotide idenity is. ‘var_type’ denoets the kind of mutation: noncoding, synonymous, nonsynonymous, etc. We generated a few additional variables, chiefly ‘SBS’, which denotes the direction of nucleotide change and ‘n_tumor’, which denotes the number of tumors of that particular cancer_type in the dataset.
Summary: We propose using nonparametric statistical methods developed in the fields of computational linguistics and statistical ecology to mutations in the mitochondrial genome. These methods estimate quantitaties and probabilities of hitherto unseen variants.
Alexander Corbet’s study of the richness of Malayan butterfly species sparked the development of statistical methods to solve the ’Unseen Species Problem.
In the early 1940’s, naturalist Alexander Corbet spent two years in Malaya trapping butterflies. At the end of his time in Malaya, Corbet constructed a table of species frequencies, denoting how many species he captured once, twice, etc.
| Frequency | Species |
|---|---|
| 1 | 118 |
| 2 | 74 |
| 3 | 44 |
| 4 | 24 |
| 5 | 29 |
| … | … |
| 15 | 6 |
When Corbet returned to England, he asked R.A. Fisher to estimate the number of new species he would capture if he spent an additional 2 years trapping butterflies. This problem referred to the column of the table with frequency 0 (the unseen species in the population of Malayan butterflies), of which there was no data available. Fisher provided an elegant and simple answer to Corbet’s question, an alternating sum of the species frequencies: \[\textrm{# New Species in 2 more years trapping}=118-74+44-24...=75\] Good and Toulmin extended Fisher’s work using a surprisingly simple formula for estimating the number of novel species one would capture if they conducted a survey of \(t\) times the original survey effort. Although estimation was only good for \(t<1\). Realizing the limitations of previous estimators, and identified a family of estimators that could provably predict the number of novel species all the way up to \(t \propto \log{n}\) (Orlitsky, Theertha, and Wu (2016)). This family of estimators works by truncating the alternating series at a random location \(L\), and then averages over the distribution of \(L\). Indeed, the bias of the estimator depends on whether you truncate the series after an addition or subtraction term. Averaging the estimator over various cutoff locations helps control the bias.
I applied the Smoothed Good-Toulmin estimator available through the R package variantprobs, as well as the Chao estimate of the total number of unseen variants based on the training mutation frequencies.
Suppose, like Alexander Corbet, you are going to go for a butterfly catching expedition. After you strap on your boots, sun hat, and binoculars, you finally grab your notebook, a catalogue of the butterfly species you have encountered on previous expeditions. In your book, there is a list of butterfly species along with the number of times you have caught each species (above).
Where \(N_1\) denotes the number of species you caught once, \(N_2\) the number of species you caught twice, etc. But, like Corbet, you haven’t caught every butterfly in your field guide yet. There are some butterflies that remain unseen in your population. Instead of estimating how many butterfly species were unseen in the population, you are interested in the probability that you capture one of these rare butterfly species in your morning expedition.
If we want to calculate the probability that the next butterfly we catch belongs to \(N_r\) (was previously observed \(r\) times): \(P(\textrm{Next object seen r times})=\frac{(r+1)(N_{r+1})}{N}\) (Gale and Hill (1995)). Intuitively, this formula estimates the rate at which seen never items become seen once items, seen once items become seen twice items, etc. This formula makes no assumptions about the underlying probability distribution, but rather relies on the empirical training data.
In the butterfly example, \(P(\textrm{Next Butterfly Is an Unseen Species (r=0)})=\frac{(r+1)(N_{r+1})}{N}=\frac{(1)(N_{1})}{\sum_{r=0}^\infty rN_r}=\frac{118}{120+2(74)+...+12(6)}\).
However, the sparseness of \(N_r\) estimates for large r often leads to poor performance. In the butterfly scenario, suppose you caught 6 species 15 times, but 0 species 16 times. Thus, \(N_{16}=0\) and \(P(\textrm{Next Butterfly Is an Species Seen 16 times})=0\). However, this group should be assigned nonzero probability. We resolve this challenge by revising probability estimation formula: \(P(\textrm{Next Object seen r times})=\frac{(r+1)S(N_{r+1})}{N}\), where \(S(\cdot)\) denotes a smoothing function of the raw \(N_r\). The smoothing function \(Z_r = \frac{N_r}{.5(t-q)}\) (where \(N_q, N_t\) are consecutive subscripts s.t. \(N_q, N_r, N_t\) are nonzero) is a simple and effective solution, as a simple linear regression of \(Z_r\) vs \(r\) on log-log scale allows us to impute smoothed \(N_r\) (Gale and Hill (1995)).
Developed by Alan Turing, the Good-Turning technique first served as a tool to decrypt message in World War II. Tasks in computational linguistics such as spelling correction, sense disambiguation, and translation often employ Good-Turing estimation to determine probabilities of encountering previously unseen words, which consequently improve performance. In recent years, the Good-Turing frequency estimation as been used in bioinformatics.
Normalized Mutual Information is an information theoretic measure of the dependence between two random variables. \(NMI(X,Y)=\frac{MI(X,Y)}{\sqrt{H(X)*H(Y)}}\), is akin to the Pearson correlation coefficient but able to detect nonlinear relationships. \(NMI\) is calculated from the marginal and conditional shannon entropies, and can be considered as the reduction in uncertainty in \(X\) achieved by observing \(Y\). \(NMI\) is defined on the [0,1] scale, and in this project, is used to quantify the tissue-specificity of mutation probabilities.
Summary: Consistent with previous research, we observe extreme strand bias in mutation signatures, suggesting failure of mitochondrial-specific DNA replication/repair mechanisms may be responsible for mutations. Analysis of mutation burden along the mitochondrial genome show hotspot mutations in the regulatory D loop and ND4 gene. The regulatory D loop (Start: 16024, End: 576) had the highest mutation rate/locus of any region of the mitochondrial genome. Singleton variants represent the majority of mutations, although a considerably smaller fraction than in the nuclear genome. We estimate future sequencing efforts will yield an estimated thousands to tens of thousands of new variants. Many tissue-specific hotspot mutations were identified, while previously unseen mutation probabilities were found to be non tissue-specific. We also explore regional sequence entropy of the mitochondrial genome.
Plots of mutation signature frequencies illustrate a largely homogenous mutation profile among mitochondrial mutations. In accordance with convention, we’ve converted each mutation to one of 6 SBS signatures, keeping track of which occured on the light and heavy strands. For example, mutations that are denoted “G>A L” on the light strand are converted to their complemenent on the heavy strand “C>T H”. In the figure below, the vast majority of mutations are Guanine to Adenine substitutions (T>C on the heavy strand), which comprise approximately 70.6% of total mutations. The next most common mutation was Thymine to Cytosine subsitutions (T>C on light strand), which comprised 24.3% of all mutations. All other mutations had less than 5% frequency. As noted in (Yuan et al. (2020)), these extreme strand-specific mutation biases reflect that malfunction of mitchondria-specific DNA replication or repair mechanisms are the responsible for mutation formation.
Examination of these mutation patterns across different cancer types reveals heterogeneity in mutation burden between cancer types. Plotted below are the mutation frequnecies per tumor across the 36 different cancer types, faceted by the SBS mutation signature. Panels denoting T>C Heavy and T>C Light clearly have the highest mutational burdens, but there exists clear heterogeneity in the frequencies of these mutation signatures. Liver Hepatocellular Carcinoma (Liver-HCC) has the highest mutation burden per tumor across all cancer types in both signatures.
G>A variants dominate mitochondrial single nucleotide mutations
Examining the mutational landscape of the mitochondria illustrates well-known hotspot mutations that have been documented in the literature. The most common mutation (frequency 60) occurs at the 72 bp along the regulatory D loop. The next most common mutation occurs with frequency 34 along the ND4 gene. The third most common mutation (with frequency 26) occurs along the 12S gene.
Unseen variant estimates
Regional variation in mutation rate and types has been documented in the nuclear genome of human cancer (Lawrence et al. (2013)). Visualizing the regional heterogeneity of overall mutation density shows that the average mutation frequency is highest from 15000 to 2000, coindicing with the coordinates of the regulatory D loop (Start:16024 to End:576). There’s also a minor peak in mutation density between 2500 and 5000 bp, corresponding to the ND1 gene.
Unseen variant estimates
We assessed the tissue specificity of hotspot mutations with mutation frequency greater than or equal to 22 in the dataset, i.e. this mutation occurs in greater than 1% of our samples. 43 variants attained this mutation threshold criteria. We calculated the proportion of mutations (scaled by the number of tumors in each cancer type category) that belonged to each cancer type. The cancer type specificity of these mutation probabilities were assessed using normalized mutual information.
The bubbleplot shown below illustrates that different hotspot mutations show preferences for different cancer types. The most tissue specific hotspot variant, C>A at position 307, had an NMI of 0.175, and had highest probabilities in Esophogeal and Liver Cancers. The second most specific mutation, G>A at position 16566 associates with Kidney and Colorectal Cancers. The third most specific G>A at position 70 associates with Bone/Soft Tissue Cancers, Liver, and Kidney Cancers. These results support the idea that hotspot mutations in the mitochondrial genome may be useful in classifying cancer types.
Mitochondrial single nucleotide variants are mostly singleton variants
A plot of the frequencies of frequencies indicates that the slight majority of variants (0.644 of total SNV) are singletons, meaning that they are observed only once in the 2173 tumor samples. This proportion is much less with the proportion of singleton variants observed in the nuclear genome, as in The Cancer Genome Atlas, >90% of somatic variants were singletons. This finding reinforces that mutational processes in the mitochondrial genome are different from the nuclear genome. Still, finding ways to extract clinical information from these rare variants is a worthy task.
Unseen variant estimates
Plotted above are the Chao estimate of the number of unseen mutations given training mutation frequencies, as well as the smoothed good turing estimates of the number of novel mutations to be discovered if another 2177, 4354, and 21770 future tumor samples are sequenced. These data indicate that there is tremendous value in retaining future mitochondrial genomic reads from human cancer, as we can anticipate discovered thousands of novel single nucleotide genomic mutations which could be useful in understanding carcinogenesis.
These singleton variant frequencies were broken down by genomic features and cancer type, and used to calculate the Smoothed Good-Turing Probability, corresponding to the probability of encountering one or more previously uncatalogued (novel) variant in a particular gene in a future sequenced tumor. These probabilities were calculated from each tissue type, to assess if any genes encoded unseen variant probabilities with tissue-specific signals. Bubbleplots of the unseen variant probabilities show variability between tissue types, suggestive that they may contain gene and tissue-specific signals.
The significance of these unseen mutation probabilities were evaluated using simulation. Normalized mutual information (NMI) is an information-theoretic metric used to capture a general relationship between two random variables (in this case, mutation probability and cancer labels). NMI was used to measure the cancer-type specificity of each gene’s unseen variant probabilities. A null distribution of NMIs was computed by permuting the tissue labels of all mutations 1000 times, in effect, erasing any tissue-variant relationship. When compared to the null distribution, the NMI distribution produced by the real variant-tissue allocations was smaller than that of the null, indicating that any tissue specificity was non-tissue specific.
The significance of these unseen mutation probabilities were evaluated using simulation. NMI was used to measure the cancer-type specificity of each gene’s unseen variant probabilities. A null distribution of NMIs was computed by permuting the tissue labels of all mutations 1000 times, in effect, erasing any tissue-variant relationship. When compared to the null distribution (red), the NMI distribution (blue) produced by the real variant-tissue allocations was smaller than the 95th quantile of the null distribution, indicating that any tissue specificity was not significant.
We hypothesized that there may exist a link between mutation frequency and sequence entropy. We developed a sliding-window approach, which slides a window of fixed length along the genome and calculates the Shannon entropy value for each window, based on the frequencies of A,C,T,G in the window. We developed a Shiny application to allow for visualization of these entropies. Visualization of the mitchondrial genome appears to show regular fluctuations in Shannon entropy along the genome, suggesting that complex genomic regions may be evenly spaced along the genome. Visual comparison of regional mutation density and sequence entropy plots doesn’t demonstrate a clear relationship. Further investigation is warranted.
This research was conducted as a final project for Data Mining course at Claremont-McKenna College. Thanks to Mike Izbicki for his guidance and edits to this project.
Brandon, M, P Baldi, and D C Wallace. 2006. “Mitochondrial mutations in cancer.” Oncogene 25: 4647–62. https://doi.org/10.1038/sj.onc.1209607.
Chakraborty, Saptarshi, Arshi Arora, Colin B. Begg, and Ronglai Shen. 2019. “Using somatic variant richness to mine signals from rare variants in the cancer genome.” Nature Communications 10: 1–9. https://doi.org/10.1038/s41467-019-13402-z.
Gale, William A, and Murray Hill. 1995. “Good-Turing Smoothing Without Tears.” Journal of Quantitative Linguistics 2: 1–24.
Hanahan, Douglas, and Robert A Weinberg. 2011. “Review Hallmarks of Cancer : The Next Generation.” Cell 144 (5): 646–74. https://doi.org/10.1016/j.cell.2011.02.013.
Hertweck, Kate, and Santanu DesDasgupta. 2017. “The Landscape of mtDNA Modifications in Cancer : A Tale of Two Cities.” Frontiers in Oncology 7 (November): 1–12. https://doi.org/10.3389/fonc.2017.00262.
Lawrence, Michael S, Petar Stojanov, Paz Polak, Gregory V Kryukov, Kristian Cibulskis, Andrey Sivachenko, Scott L Carter, et al. 2013. “Mutational heterogeneity in cancer and the search for new cancer-associated genes.” Nature 499: 214–18. https://doi.org/10.1038/nature12213.
Orlitsky, Alon, Ananda Theertha, and Yihong Wu. 2016. “Optimal prediction of the number of unseen species.” PNAS 113 (47): 13283–8. https://doi.org/10.1073/pnas.1607774113.
Yuan, Yuan, Young Seok Ju, Youngwook Kim, Jun Li, Yumeng Wang, Christopher J Yoon, Yang Yang, et al. 2020. “Complete molecular characterization of mitochondrial genomes in human cancers.” Nature Genetics 52 (March): 342–52. https://doi.org/10.1038/s41588-019-0557-x.
Zong, Wei-xing, Joshua D Rabinowitz, and Eileen White. 2016. “Mitochondria and Cancer.” Molecular Cell 61 (5): 667–76. https://doi.org/10.1016/j.molcel.2016.02.011.